Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate?
نویسندگان
چکیده
BACKGROUND Reliability of measurements is a prerequisite of medical research. For nominal data, Fleiss' kappa (in the following labelled as Fleiss' K) and Krippendorff's alpha provide the highest flexibility of the available reliability measures with respect to number of raters and categories. Our aim was to investigate which measures and which confidence intervals provide the best statistical properties for the assessment of inter-rater reliability in different situations. METHODS We performed a large simulation study to investigate the precision of the estimates for Fleiss' K and Krippendorff's alpha and to determine the empirical coverage probability of the corresponding confidence intervals (asymptotic for Fleiss' K and bootstrap for both measures). Furthermore, we compared measures and confidence intervals in a real world case study. RESULTS Point estimates of Fleiss' K and Krippendorff's alpha did not differ from each other in all scenarios. In the case of missing data (completely at random), Krippendorff's alpha provided stable estimates, while the complete case analysis approach for Fleiss' K led to biased estimates. For shifted null hypotheses, the coverage probability of the asymptotic confidence interval for Fleiss' K was low, while the bootstrap confidence intervals for both measures provided a coverage probability close to the theoretical one. CONCLUSIONS Fleiss' K and Krippendorff's alpha with bootstrap confidence intervals are equally suitable for the analysis of reliability of complete nominal data. The asymptotic confidence interval for Fleiss' K should not be used. In the case of missing data or data or higher than nominal order, Krippendorff's alpha is recommended. Together with this article, we provide an R-script for calculating Fleiss' K and Krippendorff's alpha and their corresponding bootstrap confidence intervals.
منابع مشابه
Variance Estimation of Nominal-scale Inter-rater Reliability with Random Selection of Raters
Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the inter-rater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all inter-rater...
متن کاملReliability of Body Landmarks Analyzer for Measuring the Quadriceps Angle
Genovarum and Genovalgum are the most common postural deformities of the knee joint. A quadriceps angle is used to measure these anomalies. Methods of measuring this angle are divided into two categories: invasive and non-invasive. The purpose of the present research was to study the inter/intra rater reliability of the non-invasive Body Landmarks Analyzer method for measuring of the quadriceps...
متن کاملA Comparison of Cohen's Kappa and Agreement Coefficients by Corrado Gini
The paper compares four coefficients that can be used to summarize inter-rater agreement on a nominal scale. The coefficients are Cohen's kappa and three coefficients that were originally proposed by the Italian statistician Corrado Gini. All four coefficients have zero value if the two nominal variables are statistically independent, and value unity if there is perfect agreement. The coefficie...
متن کاملConfidence Intervals for Intraclass Correlation in Inter-Rater Reliability
AbstractCalculation of a confidence interval for intraclass correlation to assess inter-rater reliability is problematic when the number of raters is small and the rater effect is not negligible. Intervals produced by existing methods are uninformative: the lower bound is often close to zero, even in cases where the reliability is good and the sample size is large. In this paper, we show that t...
متن کاملPsychometric properties of the Portuguese version of the Jebsen-Taylor test for adults with mild hemiparesis Avaliação das propriedades pscicométricas da versão em português do teste de Jebsen Taylor para adultos com hemiparesia leve
Objectives: To evaluate the psychometric properties of the Portuguese version of the Jebsen-Taylor Test (JTT) in patients with stroke. Methods: Forty participants who suffered a stroke in the cerebral hemisphere were videotaped while performing the JTT. Scores were defined by the time taken to perform the tasks, and two physical therapists evaluated the performance of the participants. Intraand...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 16 شماره
صفحات -
تاریخ انتشار 2016